Using Database Principles to Optimize SAS® Format Construction from Tabular Data
نویسنده
چکیده
While formats are more efficient than standard lookup tables for retrieving target values, guidelines still need to be established for building the best possible format from a control data set. More specifically, the guidelines need to link the format as a function where label=f(range) to the format's underlying table as a two-dimensional relation where entities and attributes are identified by primary keys. Fortunately, database cardinality and normalization concepts can be expanded to embrace these seemingly disparate constructs. An identity is established between a mathematic "function" and database theory's "first normal form" required for format construction. The paper then shows what happens when formats are built that either conform to or fail to meet requirements for second and third normal forms. Terms are defined and concepts are explained by diagram and example. An additional section deals with building a cluster of formats that reflect the structure of FDA data used for generating the segmented National Drug Code. Insight into database concepts should help users optimize format construction from two-dimensional tables. The twodimensional model is also extended to cover the generation of multiple formats for data that gradually change over time. Since most medical codes are updated annually, this paper is particularly relevant for those who work in the pharmaceutical industry. INTRODUCTION With the addition of significant functionality, PROC FORMAT has become a major component in the SAS programming language. A major improvement dates back to the early 1990's when SAS made it possible to convert a SAS data set to a format. Initially the data sets contained two columns: one for the format range and a second for a corresponding label. Subsequently, SAS users extended the source data to encompass two-dimensional tables of related attributes containing both row and column identifiers. With an underlying tabular structure, database management techniques have become increasingly relevant in format construction and management. Two techniques are cardinality and normalization. Cardinality provides insight about how entities with identifying keys relate to each other, and normalization describes a series of steps or normal forms that result in tables with a simplified, robust structure. Since going from table to format is a minor step, this paper shows how normalization techniques translate to formats that are efficient as well as easy to process and maintain. Key database, mathematical and SAS terms are defined. The definitions should enable those unfamiliar with database principles or mathematical notation to understand the concepts that are being discussed. However, a basic familiarity with the FORMAT procedure is assumed. This means no special attempt will be made to explain syntax or to provide instructions for building a format from a control data set. For an excellent introductory guide to the FORMAT procedure, see The Power of PROC FORMAT by Jonas Bilenas. Published in 2005 by SAS Press, it is part of Art Carpenter's SAS® Software Series. A SUGI paper, SAS® Software Formats: Going Beneath the Surface by Roger Staum, provides the mathematical foundation upon which this paper is based. FIRST NORMAL FORM (1NF): REQUIRED FOR CONSTRUCTING A FORMAT RELEVANT TERMINOLOGY FOR FIRST NORMAL FORM
منابع مشابه
Using SAS Software to Analyze Sybase Performance on the Web
This paper provides a web-based system using SAS, HTML and CGI/PERL to provide rudimentary and complex Sybase DBMS performance metrics for Unix based system operations. Sybase SQL Server performance data is collected by Sybase Historical Server allowing for the collection of performance information with minimal impact on the server. The SAS System (Base SAS, Macro, STAT and SAS/Graph) is especi...
متن کاملDeveloping a National Health and Nutrition Examination Survey (NHANES) SAS®/Web-based Quality Assurance Application
The National Health and Nutrition Examination Survey (NHANES) is a comprehensive survey which collects data on the health and nutritional status of the United States population through personal interviews and an extensive physical examination in a Mobile Examination Center (MEC). Continuous quality assurance (QA) is used to ensure high quality survey data. One aspect of QA is the data review pr...
متن کامل119-2010: Blistering ETL Performance Using the Intelligent, Dynamic, and Parallel Capabilities of SAS®
ETL from billion-record+ databases is a non-trivial task. By using the unique capabilities of SAS® to analyze the database in advance and then dynamically generating parallel SAS® jobs based on this interrogation, SAS® can effectively performance-tune itself each run for maximum benefit. Get in, get out, get the results. The net effect is to get the fastest possible results with the minimum win...
متن کاملAutomatically Converting Tabular Data to Rdf: an Ontological Approach
Information residing in relational databases and delimited file systems are inadequate for reuse and sharing over the web. These file systems do not adhere to commonly set principles for maintaining data harmony. Due to these reasons, the resources have been suffering from lack of uniformity, heterogeneity as well as redundancy throughout the web. Ontologies have been widely used for solving su...
متن کاملConstructing a Data Warehouse for Pharmacokinetic Data
The construction of a warehouse using data from DuPont Pharmaceutical’s Clinical Pharmacokinetics Group with the SAS/Warehouse Administrator product is described herein. Data sources incorporated into the pharmacokinetic warehouse existed in multiple formats (Oracle, Excel, SAS data sets, etc) and were represented by numerous individual files, rather then few large files. In most cases, the da...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006